GWU Data Science Datathon

DC Crash Data Analysis

Tanaya Kavathekar

4/4/2021

Abstract

Road accidents has been a critical problem as every year more than 1.2M people die across the globle. There is a pressing need to make use of the data and understand the underlying cause of problem. Road safety issues are complex. There are significant differences in policies within and across the countries. In this analysis, the data from Metropolitan Police Department's (MPD) crash data management system (COBALT) is studied to find relationship between fatality and independent features such as vehicle related information(vehicle type) and person related information (category, age, impaired, etc). The crash data is for DC state. The analysis shows that speed, age, vehicle type motor cycle and license region play a significant role in the fatality rate. Given the data features and observations, I have made two suggestions in policy based on speed and motor cycles. This is recomemndations are supported by the research paper from Transportation Research Record Journal and DC DMV rule books.

Pipeline

Data Preprocessing

In [43]:
data.describe()
Out[43]:
id crime id person id age
count 5.963810e+05 5.963810e+05 5.963810e+05 426744.000000
mean 4.384924e+08 2.672116e+07 8.506922e+07 38.668302
std 1.721813e+05 1.238390e+06 8.613766e+06 20.897059
min 4.370014e+08 2.341134e+07 1.045383e+07 -7990.000000
25% 4.383433e+08 2.532167e+07 8.474899e+07 27.000000
50% 4.384924e+08 2.680585e+07 8.497752e+07 37.000000
75% 4.386415e+08 2.769386e+07 8.712287e+07 51.000000
max 4.387906e+08 2.872803e+07 9.077153e+07 237.000000

Outlier treatment

Age variables has lot of outliers. From the box plot values below 0 and above 85 are considered as outliers and replaced by na

In [71]:
data['age'] = np.where(data['age']<1, np.nan, data['age'])
fig = px.histogram(data, x="age", marginal="box", title="Age distribution in the dataset")

fig.show()

Missings values detection and treatment

Only age column shows 30% of missing data which is replaced by 0

Column Creation

  • Mask columns boolean fields fatal, impaired, speeding and ticket issued to 1s and 0s
  • Fatality Rate - at an accident level, number of people faced fatal accidents/number of total people involved in the accidents
  • Severity level - Club columns for minor, major and fatal accidents with indicator labels
  • Map US states into 4 regions designated by US census

EDA

In this analysis, fatal is the target variable. As indicated below the dataset is heavily imbalanced which is treated futher. Only 0.06% (~417) are fatal

In [52]:
print("Percentage of non fatal class {} and fatal class {} ".format(
    round((data.groupby(['fatal']).agg({'fatal':'count'})['fatal'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['fatal']).agg({'fatal':'count'})['fatal'][1]/ data.shape[0])*100, 2)))

print("Percentage of non major class {} and major class {} ".format(
    round((data.groupby(['major injury']).agg({'major injury':'count'})['major injury'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['major injury']).agg({'major injury':'count'})['major injury'][1]/ data.shape[0])*100, 2)))

print("Percentage of non minor class {} and minor class {} ".format(
    round((data.groupby(['minor injury']).agg({'minor injury':'count'})['minor injury'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['minor injury']).agg({'minor injury':'count'})['minor injury'][1]/ data.shape[0])*100, 2)))
Percentage of non fatal class 99.93 and fatal class 0.07 
Percentage of non major class 96.42 and major class 3.58 
Percentage of non minor class 88.93 and minor class 11.07 

How many people involved in the accidents suffered injury?

Most of the accidents are non fatal but people meeting an accident and resulting into injury are 1 or sometimes 2

In [53]:
crime = pd.DataFrame(data[['crime id', 'severity']].value_counts())
crime = crime.reset_index()
crime.columns = ['crime id', 'severity', 'count of people']
fig = px.histogram(crime, x="count of people", color='severity', marginal="box", title="Number of people met with an accident along with severity level")
fig.show()

Which vehicle type has maximum accidents and most fatal accidents?

  1. Dataset is dominated by the accidents by passenger car/automobile. Which is natural because everyday there are more of number of cars than other vehicle type
  2. Hence maximum minor, major and fatal injuries are observed in passenger car/automobile
  3. Fatal injuries are followed by motor cyle, other vehicles and SUV(Sport Utility Vehicle)
In [54]:
vhCount = pd.DataFrame(data[['vehicle type', 'severity']].value_counts())
vhCount.reset_index(inplace=True) 
vhCount.columns = ['vehicle type', 'severity','count']

fig = px.bar(vhCount, x='vehicle type', y='count', color='severity',
             title="Number of accidents by vehicle type",
             template="simple_white")
# fig.layout.update(showlegend=False) 

fig.show()

Which state people travel most to dc or within dc?

  1. As data belonged to DC region, naturally maximum license holders will belong to DC, Virginia and Maryland
  2. People from all over US is observed in the data
In [55]:
vhCount = pd.DataFrame(data[['License state']].value_counts())
vhCount
vhCount.reset_index(inplace=True) 
vhCount.columns = ['License state', 'count']

fig = go.Figure(data=go.Choropleth(
    locations=vhCount['License state'], # Spatial coordinates
    z = vhCount['count'], # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Accidents",
))

fig.update_layout(
    title_text = 'Accidents by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

Does the fatality rate of accidents changes for different state licenses?

Plot shows that people holding licenses from different states have higher fatality rate, this could be because of two reasons:

  1. People not belonging to DC may not familiar with the terrain
  2. People may not be acquainted with the different rules of State
In [56]:
stateFatality = data[data['fatality rate']>0.0000]
stateFatalityAvg = stateFatality.groupby(['License state'], as_index=False).agg({'fatality rate':'mean'})
stateFatalityAvg

fig = go.Figure(data=go.Choropleth(
    locations=stateFatalityAvg['License state'], # Spatial coordinates
    z = stateFatalityAvg['fatality rate'], # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Accidents",
))

fig.update_layout(
    title_text = 'Average fatality rate by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

Does age affect the fatality rate of an accident?

No evident pattern and correlation is observed between the fatality rate and age

In [57]:
# fatality rate and age
fig = px.scatter(data, x="age", y="fatality rate", color='severity')
fig.show()

Which category of the persons type met with maximum accidents and suffered maximum injury?

In [74]:
ptCount = pd.DataFrame(data[['person type', 'severity']].value_counts())
ptCount.reset_index(inplace=True) 
ptCount.columns = ['person type', 'severity','count']

# fatal = ptCount[ptCount['fatal'] == 'Y']
# nonfatal = ptCount[ptCount['fatal'] == 'N']
# # x = ptCount['person type']
# # y = ptCount['count']

fig = px.bar(ptCount, x='person type', y='count', color='severity',
             title="Number of people met with an accident by person type",
             template="simple_white", 
             category_orders={ # replaces default order by column name
                 "person type": ["Driver", "Passenger", "Pedestrian", "Bicyclist"]
            },)

percentage = pd.crosstab(data['person type'], data['severity']).apply(lambda r: (r/r.sum())*100, axis=1)

fig.add_annotation(text="Accidents % <br> Fatal: {},<br> Major: {}, <br> Minor: {}".format(
    (round(percentage['fatal']['Driver'], 2)),
    (round(percentage['major']['Driver'], 2)),
    (round(percentage['minor']['Driver'], 2))
    ), x="Driver", y=70000, showarrow=False)

fig.add_annotation(text="Accidents % <br> Fatal: {}, <br>Major: {},<br> Minor: {}".format(
    (round(percentage['fatal']['Passenger'], 2)),
    (round(percentage['major']['Passenger'], 2)),
    (round(percentage['minor']['Passenger'], 2))
    ), x="Passenger", y=70000, showarrow=False)
fig.add_annotation(text="Accidents % <br> Fatal: {},<br> Major: {},<br> Minor: {}".format(
    (round(percentage['fatal']['Pedestrian'], 2)),
    (round(percentage['major']['Pedestrian'], 2)),
    (round(percentage['minor']['Pedestrian'], 2))
    ), x="Pedestrian", y=70000, showarrow=False)
fig.add_annotation(text="Accidents % <br> Fatal: {}, <br>Major: {}, <br> Minor: {}".format(
    (round(percentage['fatal']['Bicyclist'], 2)),
    (round(percentage['major']['Bicyclist'], 2)),
    (round(percentage['minor']['Bicyclist'], 2))
    ), x="Bicyclist", y=70000, showarrow=False)

# fig.layout.update(showlegend=False) 
fig.show()
  1. Data contains 4 types of persons such as driver, passenger, pedestrian and bicyclist
  2. Driver contributes 77% of data entries followed by 19 % passenger, 2 % pedestrians and 0.6% bicyclist
  3. Although, driver has maximum entries only 0.05% accidents are fatal whereas 2.63% and 8.04% are major and minor
  4. Passenger suffered maximum minor injuries in the accidents.
  5. Pedestrian do not frequently meet with an accident but when occurred they have suffered 53% minor injuries, 24% major and 0.7% fatal injuries. They have maximum fatalities compared to other types
  6. Similarly, bicyclist have least entries of accidents but when met with an accident they suffered 60.82% of minor injuries 8% major injuries and 0.22% fatal injuries

Analysis according to person type

As very person type plays different role in transportation, the rules and policies are designed differently. Hence, lets dive deeper into person type to identify interesting patterns.

Driver

  1. Plot shows that maximum count of drivers in the age group 25 to 30
  2. Median of ages in the severity level is nearly same
  3. All 3 groups nearly follow the same distribution pattern
In [60]:
driver = driver[driver['age'] >1]
fig = px.histogram(driver, x="age", color='severity', marginal="box", title="Age distribution in driver")
fig.show()

Passenger

  1. In case of fatal accidents, passenger median age is 33 compared to other group which is around 27
  2. Passengers within age group below 16 and 22-26 have faced injuries
In [61]:
passenger = data[data['person type'] == 'Passenger']
passenger = passenger[passenger['age'] >1]
fig = px.histogram(passenger, x="age", color='severity', marginal="box", title="Age distribution in passenger")
fig.show()

Pedestrian

According to person type age group varies, hence it is an important variable to be considered during policy development

In [63]:
pedestrian = pedestrian[pedestrian['age'] >1]
fig = px.histogram(pedestrian, x="age", color='severity', marginal="box", title="Age distribution in pedestrian")
fig.show()

Bicyclist

In [64]:
Bicyclist = data[data['person type'] == 'Bicyclist']
Bicyclist = Bicyclist[Bicyclist['age'] >1]
fig = px.histogram(Bicyclist, x="age", color='severity', marginal="box", title="Age distribution in bicyclist")
fig.show()

Model Selection

Random Uniform Under Sampling

As the dataset is heavily imbalanced on class non fatal and vehicle type passenger car/automobile, random uniform under sampling is performed.

Relationship Strengths

In EDA the relationships which were visually observed are quantitied here.

  1. In case of numerical columns Pearson correlation is calculated.
  2. In case of categorical to measure the association Cramer's V is calculated.

  3. Vehicle type and fatal/fatality rate are highly associated

  4. Fatality rate is also associated with speeding and person type
In [67]:
dython.nominal.associations(a[['severity', 'fatal', 'fatality rate','vehicle type', 'ticket issued','Region', 'impaired','age', 'speeding', 'person type']], theil_u=True)
Out[67]:
{'corr':                severity     fatal  fatality rate  vehicle type  ticket issued  \
 severity       1.000000  1.000000       0.797309      0.070944       0.077593   
 fatal          1.000000  1.000000       0.797309      0.378625      -0.053937   
 fatality rate  0.797309  0.797309       1.000000      0.338547       0.021171   
 vehicle type   0.128911  0.378625       0.338547      1.000000       0.239998   
 ticket issued  0.077593 -0.053937       0.021171      0.239998       1.000000   
 Region         0.033607  0.204898       0.149957      0.177092       0.124463   
 impaired       0.084922  0.084922       0.069527      0.127536       0.014524   
 age            0.129449  0.081324       0.162448      0.168203       0.111451   
 speeding       0.225998  0.225935       0.283369      0.349592       0.156124   
 person type    0.097725  0.320938       0.254655      0.264039       0.252625   
 
                  Region  impaired       age  speeding  person type  
 severity       0.043696  0.084922  0.129449  0.225998     0.095440  
 fatal          0.204898  0.084922  0.081324  0.225935     0.320938  
 fatality rate  0.149957  0.069527  0.162448  0.283369     0.254655  
 vehicle type   0.418393  0.127536  0.168203  0.349592     0.468558  
 ticket issued  0.124463  0.014524  0.111451  0.156124     0.252625  
 Region         1.000000  0.081594  0.146141  0.079717     0.289970  
 impaired       0.081594  1.000000  0.007665 -0.021036     0.134036  
 age            0.146141  0.007665  1.000000  0.038538     0.405564  
 speeding       0.079717 -0.021036  0.038538  1.000000     0.179477  
 person type    0.386048  0.134036  0.405564  0.179477     1.000000  ,
 'ax': <matplotlib.axes._subplots.AxesSubplot at 0x24539230388>}

Multiple Linear Regression

fatality rate = 14.23 + 0.24age + 24speeding and so on

Independent features used are vehicle type, age, speeding and person type is used.

In [68]:
X = a[['vehicle type', 'age', 'speeding', 'person type']]
y = a[['fatality rate']]
X = pd.get_dummies(data = X)
X['age'] = X['age'].fillna(0)

import statsmodels.api as sm
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()
Out[68]:
OLS Regression Results
Dep. Variable: fatality rate R-squared: 0.203
Model: OLS Adj. R-squared: 0.184
Method: Least Squares F-statistic: 10.37
Date: Sun, 04 Apr 2021 Prob (F-statistic): 4.59e-29
Time: 22:43:43 Log-Likelihood: -3878.4
No. Observations: 834 AIC: 7799.
Df Residuals: 813 BIC: 7898.
Df Model: 20
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 14.0541 4.095 3.432 0.001 6.016 22.092
age 0.2186 0.042 5.148 0.000 0.135 0.302
speeding 31.4987 4.093 7.697 0.000 23.466 39.532
vehicle type_Atv (all Terrain Vehicle) 33.5239 24.257 1.382 0.167 -14.090 81.138
vehicle type_Bus -16.8753 8.583 -1.966 0.050 -33.723 -0.028
vehicle type_Cargo Van -0.8294 14.218 -0.058 0.953 -28.737 27.078
vehicle type_Construction/industrial Equipment -22.8143 24.265 -0.940 0.347 -70.444 24.816
vehicle type_Drugs/ Narcotics 3.2260 7.854 0.411 0.681 -12.191 18.643
vehicle type_Firearms -6.7576 5.259 -1.285 0.199 -17.081 3.565
vehicle type_Large/heavy Truck -6.6042 10.276 -0.643 0.521 -26.775 13.566
vehicle type_Moped/scooter 22.8417 11.187 2.042 0.041 0.882 44.801
vehicle type_Motor Cycle 16.4461 5.034 3.267 0.001 6.565 26.327
vehicle type_None 7.3616 10.960 0.672 0.502 -14.151 28.874
vehicle type_Other Small/light Truck -6.2048 11.195 -0.554 0.580 -28.179 15.769
vehicle type_Other Vehicle -0.2324 4.035 -0.058 0.954 -8.153 7.688
vehicle type_Passenger Car/automobile -0.6742 3.153 -0.214 0.831 -6.862 5.514
vehicle type_Passenger Van -2.3356 5.186 -0.450 0.653 -12.516 7.844
vehicle type_Pickup Truck -4.9473 7.851 -0.630 0.529 -20.358 10.464
vehicle type_Suv (sport Utility Vehicle) -1.0700 4.931 -0.217 0.828 -10.748 8.608
person type_Bicyclist 5.9079 6.721 0.879 0.380 -7.286 19.101
person type_Driver -2.6048 4.401 -0.592 0.554 -11.243 6.033
person type_Passenger -1.3725 4.583 -0.299 0.765 -10.368 7.623
person type_Pedestrian 12.1235 7.988 1.518 0.129 -3.556 27.803
Omnibus: 205.803 Durbin-Watson: 1.005
Prob(Omnibus): 0.000 Jarque-Bera (JB): 422.701
Skew: 1.390 Prob(JB): 1.63e-92
Kurtosis: 5.107 Cond. No. 3.17e+17


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.09e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Discussion

Multiple linear regression:

Model is built using age, speeding, vehicle type and person type variable. The explainability of the model as indicated by r2 is 20%. This means there are many other factors that play an integral role when it comes to predicting the fatality of an accident. But as indicated by F test, p value is lesser than 0.05 hence this model is significant than the intercept only model. Coefficient values indicate the strength of the variable contributing to the fatality rate. Associated p values for the coefficient shows that only age, speeding, vehicle type bus, motor cycle, SUV contributes and bicyclist to fatality rate. Rest of the variable do not statistically significantly contribute to the fatality rate. Of the significant variables, bus and suv contributes negatively to the fatality rate.

The significant coefficents can be translated as the mean of the fatality rate changes given a one unit change in the age while holding other variables as constants.

Speed: The effect is speed is on the fatality rate is much higher. Hence there is a need to device more stringent policies 3 As discussed in paper Speed and Safety by Hauer, given a change in speed one can measure the effect of fatalities and injuries. Currently, by DC DMV there are sets of rules in place related to speed such as the absolute speed limit is 55 miles per hour on interstate highways and 25 miles per hour on all other roads unless otherwise designated. In alleys, it is 15 miles per hour. On roads in school zones, the speed limit is 15 miles per hour when the school zone sign is flashing.4 There are cameras on the road to capture over speeding.

Moto cycle: From the data we also observed that the accidents from the motor cycles (21 coeff) are more fatal and significants. My suggestion is to develop a policy for different lane for motor cycles.

Random Forest

In [70]:
getRF(a)
F1 Score: 0.604

Discussion

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The forest is built on different sub samples as bootstrap=True. Tje trees are built with default parameters such as cretrion as gini, maximum depth of the tress, maximum features as square root of number of features, etc. The avoid overfitting 3 fold cross validation is performed. The score used is f1 score as the dataset is balanced by undersampling. F1 score gives harmonic mean of precision and recall. The F1 score of random forest model is 62.5%

According to the model, age is the most significant variable contributing 50% followed by vehicle type and region. Age plays a pivotal role in road accidents. Current rules are you must be at least 16 years old to get a DC DMV learner permit, and you must pass vision screening and knowledge tests5. As indicated earlier in the analysis, people belonging to different states have higher fatality rate. My suggestion for policy in this case, the person having license from different state should give a short knowledge and practical test in DC to convert out-of-state license. This will ensure that the person is well acquianted with DC rules and regulations. 6

Conclusion

  1. Analysis shows that age, motor cycle, license region and speed contributed significants to fatal and fatality rate.
  2. My policy suggestion related to motor cycle is to designate a separate lane to motor cyclist.
  3. And second suggestion is a short knowledge and practical test to convert out-of-state license.
  4. Lastly, promulgating and enforcing that deal with key risk factors and that are based on best practices are essential ingredients in achieving low fatality due to road accidents.